1 Date/time data: introduction
1.1 Learning Objectives
- Learn about Build a date object by providing individual components (year, month, day) using
{clock}and thedate_build()function. - Create a date-time object by providing individual components (year, month, day, hour, minute, second) using
{clock}and thedate_time_build()function.
1.2 Overview
As a data analyst, being able to work with temporal data (that is data associated with dates and times) is a key skill to have.
Temporal data can be inputted in several ways. For example, when recording the date a patient visited the clinic, one nurse could input the date this way - Monday, 3rd January, 2022 while another nurse could record the same date as 03/01/2022. Thus, if our aim is to find out how many patients visited the hospital on a particular day, the data must be cleaned and the columns containing the date and/or time information must be reformatted.
1.2.1 Date and time classes in R
Here are some fundamental things to know about working dates and times in R.
R has special classes to deal with date and/or time data. We are going to discuss three of them.
The
dateclass for dates (which can be formatted into different configurations of year-month-day.The
timeclass for times (hours, minutes and seconds)The
date-timeclass also called POSIXct used to represent date plus time that is to pin point a distinct moment in time on a particular day in specified time zone.
For example, to get the current date and time in your local time zone, R has two in-built functions.
Sys.Date() for the current date and Sys.time() for the current date-time in the time zone on your computer.
Sys.time()## [1] "2022-06-22 12:35:44 CEST"
Sys.Date()## [1] "2022-06-22"
1.3 Creating date and times
Clock, is an R package that has many functions that make working with dates and times easier.
To install {clock}, run the following code in your R session.
install.packages("clock").
Proceed to load the packages if you have already installed them.
1.3.1 The date_build() function
We are going to use the function date_build() in the {clock} package to date objects with the birthdays given in the report above.
clock::date_build() is a function used to create a date by providing the following arguments.
The first argument provided is the year. If only one argument is given, the date output is the first day of the first month of that year.
For example, run the code below, to quickly get the date- the first of January, 2022.
clock::date_build(2022)## [1] "2022-01-01"
You can specify the year, month and day by providing three integer values separated by commas. For example, the output of the following code is 15th March, 2022.
clock::date_build(2022,3,15)## [1] "2022-03-15"
The second argument of the
clock::date_build()function has a range of 1 to 12 to represent the 12 months (January to December).The third argument of this function has a range of 1 to 31 to represent the number of possible days in a month.
Thus, providing values outside these ranges would be invalid. For example, the code below gives an error because there is no 14th month or 40th day of the year 2007.
try(clock::date_build(2007,14))## Error : `month` must be within the range of [1, 12], not 14.
try(clock::date_build(2007,12,40))## Error : `day` must be within the range of [1, 31], not 40.
You can create a range of dates using the colon : operator between the range you want either in the year, month or day arguments.
For example, to represent the first week of February 2002, in place of the third argument (day) in clock::date_build(year, month, day),type 1:7 as shown below.
date_build(2002, 2, 1:7)## [1] "2002-02-01" "2002-02-02" "2002-02-03" "2002-02-04" "2002-02-05" "2002-02-06" "2002-02-07"
Also, to get the last day of each month the year 2015, you can type the following code in place of clock::date_build(year,month, day) .
The month argument is replaced by the range
1:12Then, the day argument is replaced by the string
"last"
clock::date_build(2015,1:12,"last")## [1] "2015-01-31" "2015-02-28" "2015-03-31" "2015-04-30" "2015-05-31" "2015-06-30" "2015-07-31" "2015-08-31" "2015-09-30" "2015-10-31"
## [11] "2015-11-30" "2015-12-31"
Supposing you are analyzing birth records from a hospital in Nairobi, Kenya in the report below.
A healthy infant male was born on 27th December 2021 at 6 am. The next day, 3 baby girls were delivered. They were born in 1 hour intervals from 3pm.
- Create date objects from this information and write code to output the baby boy’s correct birthday.
1.3.2 The date_time_build() function
To create a date-time object, a R object that contains both dates and times, we use the function date_time_build() with seven arguments (year, month, day, hour, minute, second, zone).
A key thing, to note is how to input the arguments of date_time_build(). As discussed previously, month and day have specified ranges, so do hour, minute, second and zone.
The hour argument ranges from 0-23
The minute argument ranges from 0-59
The second argument ranges from 0-59
The zone is the local timezone written in the form “continent/city” eg. “America/New York” or “Africa/Lagos”. You can check out the list of all time zones with associated cities by viewing this inbuilt R data set
OlsonNames()
head(OlsonNames()) #Get first 6 rows of the timezones data set## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa" "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"
For example, to create a date-time object of the boy born on 27th December, 2021 at 6 am in Nairobi, Kenya we should run the following code.
clock::date_time_build(2021,12,27,06,00,00, zone="Africa/Nairobi")## [1] "2021-12-27 06:00:00 EAT"
- Create a data object for the birthdays of the three baby girls born the following day after 27th December, 2021 at 3pm with one hour interval between them.
##
## 1 2
## Yowza! OH!-OH! This is frankly striking!
Contributors
The following team members contributed to this lesson:
References
Some material in this lesson was adapted from the following sources:
Wickham, Hadley, and Garrett Grolemund. “R For Data Science.” 16 Dates and times | R for Data Science, 2017. https://r4ds.had.co.nz/dates-and-times.html#time-zones.
Vaughan, Davis. “Building: Date-Time - date_time_build.” - date_time_build • clock, February 12, 2021. https://clock.r-lib.org/reference/date_time_build.html.
2 Date/time data: disaggregating dates
2.1 Learning Objectives
1. Group temporal data into different categories (days, weeks, months and years)
When working with data is important for data analysts to understand that data aggregation can hide a lot of things.
For example, when you rely on monthly case counts to tell you the entire story about the development of new disease cases, there is a lot granular data you are not taking advantage of.
Disaggregating (separating) monthly data into weekly case counts and even daily counts might reveal more information about the progression of diseases, or trend of sales of a product.
We are going to use the ebola_sierraleone_2014 data set from the {outbreaks} package to explore historical Ebola outbreak data and demonstrate this concept.
Run the code chunk below to create an R object called ebola_data.
With head(ebola_data) function, you can explore the initial rows of the data set.
#Create an R object named ebola_data
ebola_data <- outbreaks::ebola_sierraleone_2014
head(ebola_data) #show the first 6 rows of the ebola_data## id age sex status date_of_onset date_of_sample district chiefdom
## 1 1 20 F confirmed 2014-05-18 2014-05-23 Kailahun Kissi Teng
## 2 2 42 F confirmed 2014-05-20 2014-05-25 Kailahun Kissi Teng
## 3 3 45 F confirmed 2014-05-20 2014-05-25 Kailahun Kissi Tonge
## 4 4 15 F confirmed 2014-05-21 2014-05-26 Kailahun Kissi Teng
## 5 5 19 F confirmed 2014-05-21 2014-05-26 Kailahun Kissi Teng
## 6 6 55 F confirmed 2014-05-21 2014-05-26 Kailahun Kissi Teng
The names(ebola_data) function would show the names of the different columns.
names(ebola_data)## [1] "id" "age" "sex" "status" "date_of_onset" "date_of_sample" "district" "chiefdom"
It seems intuitive that ebola_data$date_of_onset and ebola_data$date_of_sample are columns containing date information.
Let us confirm that by checking the class of the columns.
class(ebola_data$date_of_onset)## [1] "Date"
class(ebola_data$date_of_sample)## [1] "Date"
Here is a monthly breakdown of a past epidemic of Ebola. We see that there is a peak in October-December.
To create the bar chart of monthly totals of new Ebola cases above, we would use the floor_date() from the {lubridate} package.
lubridate::(floor_date) takes a date-time object and rounds it down to the nearest boundary of the specified unit.
The unit argument of lubridate::(floor_date) can specify either one of the following as character strings.
(Second, minute, hour, day, week, month, bimonth, quarter, season, halfyear and year.)
For example, let’s apply floor_date() to today’s date.
floor_date() takes the current date provided by the function lubridate::today() and rounds it down to the first day of the particular month as shown below.
lubridate::today() #gives the current date in yyyy-mm-dd format## [1] "2022-06-22"
lubridate::floor_date(lubridate::today(), unit = "months")## [1] "2022-06-01"
What would be the result if the function lubridate::floor_date(x, unit= "months") is applied to Valentine’s day 2022?
In other words, round 2022-02-14 down to the first day of the particular month in the year 2022.
q1 <- "YOUR ANSWER HERE"
.check_q1()## Correct! You are wicked!
## 1 2
.hint_q1()## In May 2014, there were 57 Ebola cases, 243 Ebola cases in June 2014
## and 351 Ebola cases in July 2014
## [1] "2021-02-01"
## [1] "2022-02-01"
## [1] "2022-03-01"
2.1.1 Monthly cases
Let us get the number of total Ebola cases per month.
Assumptions
To answer this question, we assume that each individual date_of_onset entry represents a unique case of Ebola.
Thus, to get the total unique cases per month, we would use the {dplyr} function mutate() and the lubridate::floor_date() to modify the date of onset column and separate the data by months.
That is what this section of code does.
dplyr::mutate(date_of_onset = lubridate::floor_date(date_of_onset, unit = "months")) Then, the following section of code count(date_of_onset) counts the number of unique instances of date_of_onset grouped by months.
To ensure that, all possible dates are included in the new column month_onset, use the complete() function from the {tidyr} package to create a sequence of dates and the argument fill=list(n=0)sets to zero (0) any dates that did not have corresponding inputs (cases).
Run ?complete for more details.
Without this step, any graph or table you create may not display time units (that is months) with zero reported cases.
The code below follows the sequence described to get the total monthly cases
total_monthly_cases <- ebola_data%>%
mutate(month_onset = lubridate::floor_date(date_of_onset, unit = "months"))%>%
# new column, 1st of month of onset
count(month_onset) %>% # count cases by month
tidyr::complete(
month_onset = seq.Date(
min(month_onset, na.rm=T),# include all months with no cases reported
max(month_onset, na.rm=T),
by="month"),
fill = list(n = 0))We are using mutate, group_by, count and filter functions from the {dpylr} package (part of the tidyverse packages) for data wrangling and the {ggplot2} package for plotting graphs.
2.1.1.1 Plotting Monthly case totals
Run the following code to plot the monthly Ebola case totals
total_monthly_cases%>%
ggplot(aes(x=month_onset,y=n,fill= month_onset))+geom_col()+
scale_x_date(date_breaks= "2 month",date_labels = "%B %Y")+
guides(x = guide_axis(angle = 45))+
labs(title= "A Bar chart showing the counts of Ebola cases per month")+
ylab("Monthly Case count")+xlab("Month in which cases occurred")+
geom_text(aes(label=n),angle=30,check_overlap= T)# to display case counts as labelsguides( x = guide_axis(angle = 45) is a {ggplot2} function that can be used to change the direction of the axis labels on a graph.
In the code above we use it to create diagonal axis labels by indicating angle= 45.
Refer to the graph above and find the average number of Ebola cases (mean) for the 3 month period from May 2014 to July 2014. Select the correct answer from the options below.
q1 <- "YOUR ANSWER HERE"
.check_q2()
.hint_q2()## [1] 217
## [1] 362.3333
## [1] 782.6667
2.1.2 Daily Cases
Now imagine we were monitoring the cases real-time in 2014 at the very beginning of the epidemic, a case of Ebola has been declared and we need to start looking day by day if we have a growing epidemic on our hands.
Let’s follow the code below to plot the daily case rate over May 2014 to see the progression of the spread of Ebola.
daily_cases <- ebola_data %>%
count(date_of_onset) %>% # count number of rows per unique date
complete( # ensure all days appear even if no cases
date_of_onset = seq.Date( # re-define date column as daily sequence of dates
from = min(date_of_onset, na.rm=T),
to = max(date_of_onset, na.rm=T),
by = "day"),
fill = list(n = 0))
daily_cases## # A tibble: 483 × 2
## date_of_onset n
## <date> <int>
## 1 2014-05-18 1
## 2 2014-05-19 0
## 3 2014-05-20 2
## 4 2014-05-21 4
## 5 2014-05-22 6
## 6 2014-05-23 1
## 7 2014-05-24 2
## 8 2014-05-25 0
## 9 2014-05-26 10
## 10 2014-05-27 8
## # … with 473 more rows
The table above shows a sequence of dates from May 2014 to September 2015 and their associated record of new cases of Ebola. To plot the trend of cases in the first month of the epidemic, we would have to isolate the cases that occurred in May 2014.
Thus, we would use the lubridate::month() function to create a new column containing only the month component of the date_of_onset.
Also, the lubridate::year() will be used to extract the year column.
2.1.2.1 Plotting daily case totals
daily_cases%>%
mutate(month_onset=lubridate::month(date_of_onset, label = TRUE, abbr = FALSE),
year_onset= lubridate::year(date_of_onset))%>%
filter(month_onset=="May" & year_onset== 2014)%>%
ggplot(aes(x=date_of_onset,y=n,fill= date_of_onset))+geom_col()+
scale_x_date(date_breaks= "1 day",date_labels = "%d %b %Y")+
guides(x = guide_axis(angle = 45))+
labs(title= "A Bar chart showing the daily counts of Ebola cases in May 2014")+
ylab("Daily Case count")+
geom_text(aes(label=n),check_overlap= T, vjust= -0.2)2.1.3 Weekly Case counts
The plot above shows the trend of the first days of the Ebola outbreak in May 2014.
If we wanted to get weekly totals of cases in May and June 2014, we would use the code below.
weekly_counts <-
ebola_data%>%# remove cases missing date_onset
mutate(week_onset = lubridate::floor_date(date_of_onset, unit = "week")) %>% # new column of week of onset
count(week_onset) %>% # group data by week and count rows per group
tidyr::complete( # ensure all days appear even if no cases
week_onset = seq.Date(# re-define date column as daily sequence of dates
from = min(week_onset, na.rm=T),
to = max(week_onset, na.rm=T),
by = "week"),
fill = list(n = 0))
weekly_counts%>%
mutate(year_onset=lubridate::year(week_onset),
month_onset=lubridate::month(week_onset, label= TRUE,
abbr= FALSE))%>%
filter(month_onset %in% c("May","June","July")&
year_onset==2014)%>%
ggplot(aes(x=week_onset, y= n, fill= month_onset)) +geom_col()+
scale_x_date(date_breaks = "1 week", date_labels = "%d %b %Y")+
labs(title="Ebola cases counts over time (May, June , July 2014)")+
ylab("Weekly Case_count")+
geom_text(aes(label=n),check_overlap= T, vjust= -0.2)We can see that generally there is an increase in case counts with a sharp increase in cases in the week of 9th June, 2014.
Contributors
The following team members contributed to this lesson:
References
Some material in this lesson was adapted from the following sources:
Batra, Neale, et al. (2021). The Epidemiologist R Handbook. Chapter 28: GIS Basics. (2021). Retrieved 01 April 2022, from https://epirhandbook.com/en/gis-basics.html
Wickham, Hadley, and Garrett Grolemund.”R For Data Science.” 16 Dates and times | R for Data Science, 2017. Accessed May 10, 2022 https://r4ds.had.co.nz/dates-and-times.html#time-zones.
3 Data cleaning: Introduction
3.1 Introduction to the lesson
Data cleaning is one of the important steps in data analysis. Data cleaning is the process of transforming dirty data into reliable data that can be analyzed.
It involves identifying inaccurate, incomplete, or improbable data and resolving potential data inconsistencies or errors to improve your data quality and subsequently, and overall productivity.
3.2 How do you clean data
While every dataset requires different techniques to clean dirty data, there is a need to address these issues in a systematic way.
It’s important to conserve as much of the original data as possible while also ensuring that you end up with a clean dataset.
In practice, the cleaning process may focus on finding and resolving data points that don’t agree or fit with the rest of the dataset in more obvious ways. In particular, the data cleaning process may include(but is not necessarily limited to):
- Formatting data frame column names
- Deleting all blank rows/columns
- Removing duplicate rows
- Checking for irrelevant Observations
- Converting data types into their appropriate types for analysis
- String normalization
In this chapter, we will go through the process of converting messy data, into clean or reliable data that can be analyzed in R. Multiple packages are available in R to clean the data sets. For this course, we will predominantly make use of functions from the tidyverse family of R packages, as well as the janitor and linelist packages.
This is the flow we will follow for this chapter:
We will familiarize ourselves with the data set here in this introduction lesson
We will check for structural errors in lesson 1 (cleaning column/variable names) and lesson 2 (removing empty rows/columns)
We will check for data irregularities in lesson 3 (deduplicating data entries), in lesson 4 (monitoring types and encodings), and in lesson 5 (managing missing values)
For the BEST data cleaning, you need to know what mess / what errors you are hunting for by EXPLORING the dataset, then you CLEAN.
3.3 The Real Messy Data
In terms of the data, we will use a familiar dataset, the Yaounde dataset, which gives the results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020.
It should be noted, however, that the version of the Yaounde dataset we have encountered thus far in previous lessons had already gone through the process of data cleaning.
Now, we will look at the raw data (the original data collected) to learn how to clean a messy data set. In addition, some parts of the Yaounde data set have been altered to further illustrate the data cleaning process.
The full dataset can be obtained from Zenodo, and the paper can be viewed here.
While this is not the type of data we will work with for the rest of this course, the cleaning steps we will perform here are good practice for working with real-world linelists.
For practice, we will also make use of data from a case control study investigating the temporal, spatial and household dynamics of typhoid fever in Kasese district, Uganda.
3.4 Importing and reviewing the data
yaounde <- read_csv(here::here('ch02_data_cleaning_pipeline/data/yaounde_data.csv'))To get a detailed overview of each of the variables in your dataset, there are a number of functions you could use. We recommend the use of the skim() function from the {skimr} package. This function provides an overview of the data frame and a summary of every column (by class/type).
Specifically, it provides:
- An overview of the rows and columns of the dataframe
- The data type for each variable:
skim_type - The number of missing entries (entries per variable that are
NA):n_missing - The completeness rate for each variable (a number from 0 to 1 corresponding to the ratio: missing observations / number of observations):
complete_rate - A set of summary statistic; the mean, standard deviation and the five number summary for numerical variables and the frequency and proportions for categorical variables
- Spark histograms and line graphs for the numerical variables
If your “missing” variable is a “0” or a blank space, the n_missing and complete_rate outputs of skim() will not pick up on these missing forms.
skimr::skim(yaounde)skimr output, part 1
skimr output, part 2
skimr output, part 3
You can also specify the specific variables to be included, similar to dplyr::select()
skimr::skim(yaounde,SEX,AGE)| Name | yaounde |
| Number of rows | 981 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| SEX | 2 | 1 | 1 | 6 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| AGE | 2 | 1 | 29.01 | 17.33 | 5 | 15 | 26 | 39 | 79 | ▇▇▃▂▁ |
The format of the results are a single wide data frame combining the summary results, with some additional attributes and two metadata columns:
skim_variable: name of the original variableskim_type: class of the variable
We can also use some dplyr verbs like select on skim output. We only need to specify the additional attributes mentioned above.
Use
skim()to obtain a detailed overview of the Ugandantyphoiddataset.Use
skim()to obtain a detailed overview of the Ugandantyphoiddataset. However, this time, include only the variablesAgeandLevelofeducation.
To just get information on a variable without displaying the type of variable
skim(yaounde) %>%
select(skim_type) %>%
count(skim_type)## Error in `select()`:
## ! Can't subset columns that don't exist.
## ✖ Column `skim_type` doesn't exist.
Overall, the output displayed above highlights a few key features about our data frame
- There are 981 rows and 36 columns.
- There are 20 character variables, 15 numeric variables and 1 logical variable.
In addition, looking at the output from the skim() function immediately brings to light some potential issues in the data that may need to be addressed before the data are used in an analysis.
We can identify the following issues with our data :
there is an empty column in the data (the logical column,
NAhas acomplete_rateof 0)names of variables are unclear/unclean (
is drug_parachas a whitespace in its name,treatment..combinationshas a special character,.., etc.)several variables are heavily incomplete (such as
sequelae,consultation)distributions are skewed for certain variables (such as
AGE,height cm)etc…
- What other potential issues from the
skim()output may need to be addressed before the data are used in an analysis?
Contributors
The following team members contributed to this lesson:
References
Some material in this lesson was adapted from the following sources:
Batra, Neale, et al. The Epidemiologist R Handbook. 2021.Cleaning data and core functions. https://epirhandbook.com/en/cleaning-data-and-core-functions.html#cleaning-data-and-core-functions
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/ (website), https://github.com/ropensci/skimr/.
4 Data cleaning: tidying column names
4.1 Learning objectives
- You can use
janitor::clean_names()to column names automatically. - You can use
rename()andrename_with()to clean column names manually.
4.2 Intro to the lesson
In R, column names are the “header” or “top” value of a column. They are used to refer to columns in the code, and serve as a default label in figures. They should have “clean”, standardized syntax so that we can work with them, and so that our code can be readable to other coders.
Ideally, column names:
- should be short
- should have no spaces or periods(space and periods should be replaced by underscore “_”)
- should have no unusual characters(&, #, <, >)
- should have a similar style
In this lesson, we will explore how to clean column names manually and automatically in R.
As a reminder, the data we are cleaning is the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
yaounde <- read_csv(here::here('ch02_data_cleaning_pipeline/data/yaounde_data.csv'))4.3 Columns of our Data
We can use the names() function from base R. Or return to this chapter’s intro to have a look at the output of skim()
names(yaounde)## [1] "id ind" "AGE" "AGE.CATEGORY" "SEX" "EDUCATION"
## [6] "OCCUPATION" "weight kg" "height cm" "is.smoker" "is.pregnant"
## [11] "is.medicated" "household with_children" "breadwinner" "source of_revenue" "has contact_COVID"
## [16] "igg.result" "igm result" "symptoms.." "consultation" "treatment..combinations"
## [21] "drugsource" "hospitalised" "sequelae" "respiration frequency." "is drug_parac"
## [26] "is drug_antibio" "is drug_hydrocortisone" "is drug_other_anti_inflam" "is drug_antiviral" "is drug_chloro"
## [31] "is drug_tradn" "is drug_oxygen" "is drug_other" "is drug_no_resp" "is drug_none"
## [36] "NA"
We can see that:
- some names contain spaces
- some names contain special characters such as
.. - some names are in upper case while some are not
- Display the names of the
typhoiddataset.
4.4 Automatically clean column names
4.4.1 janitor::clean_names()
A handy function for standardizing column names is the clean_names() from the {janitor} package.